Submitted to ApJ 

Preprint typeset using I^T^X style cmulatcapj v. 10/09/06 



ROBUST MACHINE LEARNING APPLIED TO ASTRONOMICAL DATASETS II: QUANTIFYING 
PHOTOMETRIC REDSHIFTS FOR QUASARS USING INSTANCE-BASED LEARNING 

Nicholas M. Ball 1-2 , Robert J. Brunner 1,2 , Adam D. Myers 1,2 , 
Natalie E. Strand', Stacey L. Alberts 1 , David Tcheng 2 , Xavier Llora 2 

Submitted to ApJ 



o 
o 



m 

(N 
> 

(N 
O 
I 

o 

c3 



ABSTRACT 

We apply instance-based machine learning in the form of a fc-nearest neighbor algorithm to the task 
of estimating photometric redshifts for 55,746 objects spectroscopically classified as quasars in the 
Fifth Data Release of the Sloan Digital Sky Survey. We compare the results obtained to those from 
an empirical color-redshift relation (CZR). In contrast to previously published results using CZRs, 
we find that the instance-based photometric redshifts are assigned with no regions of catastrophic 
failure. Remaining outliers are simply scattered about the ideal relation, in a similar manner to the 
pattern seen in the optical for normal galaxies at redshifts z < 1. The instance-based algorithm is 
trained on a representative sample of the data and pseudo-blind-tested on the remaining unseen data. 
The variance between the photometric and spectroscopic redshifts is a 2 — 0.123 ± 0.002 (compared 
to a 2 = 0.265 ± 0.006 for the CZR), and 54.9 ± 0.7%, 73.3 ± 0.6%, and 80.7 ± 0.3% of the objects are 
within Az < 0.1, 0.2, and 0.3 respectively. We also match our sample to the Second Data Release of the 
Galaxy Evolution Explorer legacy data and the resulting 7,642 objects show a further improvement, 
giving a variance of a 2 = 0.054 ± 0.005, and 70.8 ± 1.2%, 85.8 ± 1.0%, and 90.8 ± 0.7% of objects 
within Az < 0.1, 0.2, and 0.3. We show that the improvement is indeed due to the extra information 
provided by GALEX, by training on the same dataset using purely SDSS photometry, which has a 
variance of a 2 — 0.090 ± 0.007. Each set of results represents a realistic standard for application to 
further datasets for which the spectra are representative. 

Subject headings: methods: data analysis — catalogs — quasars: general — cosmology: miscellaneous 



1. INTRODUCTION 

Photometric redshifts, both from empirical training 
sets and template SEDs, are important for the applica- 
tion of objects to the study of cosmology, as they enable 
the exploration of large regions of space that are other- 
wise inaccessible. This is achieved both in cosmological 
volume through a higher number density of objects and 
in parameter space through finer b i nning . 

A fter the early work of |Baurrj (|1962l ). iKool (| 19851 ). 
and lLoh fc Spillarl (|1986l ). a variety of techniques 
were developed extensively (iGwvn fc Hartwickl H996I: 



Lanzetta et alJ1996tlMobasher et al.lll996|ISawicki et al 
1997t IConnollv et al.l ~ Fl998l : IWang et all Il998t iBem'te: 
20001 ) on galaxies in th e deep, but narrow, H ubble Deep 
Field North, fHDF-N: IWilliams et al.lll996f ). These dif- 
ferent methods were shown to be mu tually consistent a nd 
relatively accurate in blind-testing (Hog g et"aLl fr998). 

More recently, wide-field surveys with multicolor pho- 
tometry and fiber-based spectroscopy have generated 
large, uniform samples that enable photometric redshifts 
to be estimated for both galaxies and quasars. 

For galaxies in these su r veys at redshifts of z < 0-4 , 
(e.g.. iBrunner et al.lll997l |200(H [Tagriaferri et al 
Firth et alJ 120031: IVanzella et al l 120041: iBall et al 
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Collister fc Lahavi 120041 : IWadadekan 120051 ) a number of 
results have converged to an RMS dispersion of a ~ 0.02 
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(i.e., a 2 ~ 0.0004) between spectroscopic and photomet- 
ric redshifts, with no serious systematic effects. It should 
be emphasized, however, that galaxy photometry in these 
previous analy ses has been very good, ty pically a few per- 
cent or better. IWav fc Srivastaval (|2006l ) show similar re- 
sults when combining the SDSS DR2 (|Abazaiian et al.l 
l200l . GALEX GR1 (jMartin et all l2005f) and the ex- 
tended source c atalog of the 2 Micron All Sky Survey 
(jSkrutskie et al.ll2006l ). 

The results at moderate redshifts ha ve also been suc- 
cessfu l, with luminous red galaxies (|Eisenstein et al.l 
120011 ) i n the SDSS trained with redshifts in the 2SLAQ 
survey (ICannon et al.| [2006T) having an RMS of a = 0.049 
(ICollister et al.ll2007ft for a sam ple at 0.4 < z < 0.7 (see 
also iPadmanabhan 

At high redshifts, the number of spectra available is 
smaller and, in addition to the HDF-N, there have been 
analyses of other de ep fields such as the HDF-South 
Wiili ams et~aIll2000D and the Hubble Ultra Dee p Field 
Beckwith et al.ll2006l ). In the latter, ICoe et al l (|2006l ) 
show an accuracy of Az = 0.04(1 + z) for z < 6. 

In contrast to galaxies, which show small numbers 
of outliers but no significant groups of outlying ob- 
jects, all wi de-field quasar photometric redshift re- 
sults to date (jRichards et al. 2001; Budavari et al.ll200ll 
I Weinstein et alJ 12004 IWu et all 120041 : |Babbedge_eti_alJ 
l2004f) suffer from regions of 'catastrophic' failure, in 
which groups of objects are assigned a redshift very dif- 
ferent from the true value. The first four use SDSS data, 
while the latter uses the ELAIS Nl and N2 fiel d s and 
the Chandra Deep Field North. IWeinstein et all (|2004 
hereafter W04) implement an empirical method based 
on color-redshift relations, which we use as our base- 
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line. Catastrophic failures severely hamper cosmological 
investigations that use photome t rically selected quasar 
samples (e.g.. iMvers et alj|200d l2007allH ). particularly 
by assigning objects at z > 2 to z < 1 and vice-versa, 
thus eliminating these regions is important. Reasons for 
the failures, depending on the details of the way a partic- 
ular dataset is chosen, include quasar reddening, degener- 
acy in the color-redshift relation, and superimposition of 
emission from another object, for example, an extended 
host galaxy. 

Results using a more restricted parameter space 
ijWolf et alJl2003h . defined by 17 < R < 24 and 1.2 < 
z < 4.8 in the 17 filter set of the COMBO-17 survey (e.g., 
IWolf et al.ir2004h . have met with more success. However 
the sample size, 192 quasars, is small, and limited in 
angular extent, and therefore is of limited cosmological 
applicability. 

In this paper we utilize optical data from the Fifth 
Data Release of the SDSS, and near- and far-UV data 
from the Second D ata Release of the Galaxy Evolution 
Explorer (GALEX; iMartin et "ail 120051 ) to assign photo- 
metric redshifts to quasars. Our results improve upon 
previous wide-field techniques, by eliminating regions of 
catastrophic failure, resulting in a distribution of quasar 
photometric redshifts comparable to those obtained for 
galaxies. We do not address the application of the pho- 
tometric redshifts to any parameter space beyond that 
represented by the training and blind test sets. 

2. DATA 

We utilize data from the Fifth Data Release (DR5, 
SDSS collaboration, i n preparation) of the Sloan Digi- 
tal Sky Survey (SPSS. r^rket^[20M) and the Second 
Data Release (GR2 ) of the Galaxy Evolution Explorer 
(jMartin et al.l [20051 ). We select primary non-repeat ob- 
servations of objects classified as quasars (specClass = 
qso or hiz_qso) in the specObj view of the SDSS DR5 
Catalog Archive Server database. The hiz_qso objects 
are at redshifts of z > 2.3 and trigger the use of the Ly- 
man a finding code in the SDSS spectroscopic pipelines 
(Frieman et al., Schlegel et al., in preparation). We also 
require that the spectroscopic flags zWarning = and 
zStatus > 2, and that all input magnitudes are not at 
clearly unphysical extreme values, being in the range 0- 
40. The resulting sample contains 55,746 quasars. 

In addition to the SDSS sample, the SDSS objects are 
cross-matched to the primary photometric objects in the 
photoObjAll view of the GALEX GR2 database. We 
find 8,174 matches within an RA+DEC tolerance of 4 
arcsec. 532 of these have more than one match and 
are rejected, leaving an SDSS+GALEX sample of 7,642 
unique matches. For the GALEX objects, we require 
primary J: lag = 1, a detection in both near and far-UV 
bands, magnitudes again in the range 0-40, and the flags 
fuv_artif act and nuv_artif act to be 0. 

Throughout, the SDSS magnitudes ar e corrected for 
Galac tic extinction using the dust maps of lSchlegel et al.l 
(1998) and the GALEX magnitudes using the B - V 
(e_bv) ter m inferred from these maps using the standard 
formula of ICardelli et~aTl (fl989t ). 

The resulting samples of 55,746 and 7,642 objects form 
training sets used as input for the learning algorithms. 
The full set of object attributes for the SDSS sample 
consists of 16 training features. These are the colors 



u ~ 9i 9 ~ r i r ~ h an d i — z, where the SDSS bands u, 
g, r, j, and z are given for each of the four magnitude 
types PSF, fiber, Petrosian, and model (|Stoughton et al.1 
120021) . For SDSS+GALEX, we add the colors fuv - nuv 
and nuv — u, where u is given in each of the four SDSS 
magnitude types, resulting in 21 training features. 

In addition to the SDSS and SDSS+GALEX datasets, 
we also analyze the SDSS+GALEX sample of objects, 
but using only SDSS features. This dataset, referred to 
as GALEX-SDSS-only, enables us to quantify the level 
of improvement in SDSS+GALEX seen from the addi- 
tion of the GALEX UV features, as opposed to possible 
improvement due to the sample only containing quasars 
that appear in both SDSS and GALEX. 

3. ALGORITHMS 

We implement instance-based learning on the SDSS, 
SDSS+GALEX and GALEX-SDSS-only datasets. The 
results are compared to those on the same data for an 
empirical color-redshift relation containing full probabil- 
ity density functions (Strand, in preparation). We also 
study the utility of subsets of the full set of training fea- 
tures using genetic algorithms. 

The machine learning is impl emented in the Ja va en- 
vironment Data-to-Knowledge fWel ge et al.| [T999) . It is 
optimized through use of nationally peer-reviewed allo- 
cated time on the Xeon Linux cluster Tungsten at the 
National Center for Supercomputing Applications. This 
enables an extensive exploration of the parameter space 
describing the training features of the objects and the 
settings of the learning algorithms. 

3.1. Instance-based Learning 

Instance-based lear ning (IB, e.g. lAha et al.1 Il991t 
IWitten fc Frank! 120001 : iHastie et all |2001|) . is a power- 
ful class of empirical machine learning methods that 
to date has not been extensively utilized on large as- 
tronomical datasets due to its computational intensity. 
Two examples where the method has b e en us ed are 
iBudavari et~aH (|2001h and ICsabai et all (|2003h . who 
both use the method on the SDSS Early Data Release 
(EDR IStoughton et al.ll2002t ). However, they only uti- 
lize single nearest neighbors, and in addition the DR5 
dataset analyzed herein is approximately 15 times the 
size of the EDR. Here, through the use of Tungsten ($3l 
above), we are able to realize the full potential of the 
algor ithm, via the use of the fc-nearest neighbor method 
(e.g. JCover fc Hardll96l . 

In its simplest form, the 'training' of the algorithm is 
trivial, and involves simply memorizing the positions of 
each of the examples in the training set. For each object 
in the testing set, the nearest training example is then 
found, and the predicted value, either a classification or 
a continuous value, is taken to be that of the training 
example. Thus the computational expense is incurred at 
the time of classification, as a large number of distance 
calculations must be performed. However, the method is 
powerful because it uses all of the information available 
in the training set, rather than a model of the training 
set as is typically used by most other learning algorithms. 

There are a number of simple refinements to this 
method, which in practice result in large improvements in 
performance: (1) Instead of the nearest neighbor to the 
testing example, the k nearest neighbors can be found, 
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and the distances weighted using a predictive integration 
function to produce a weighted output. This function, d, 
takes the form 

k 1 

d = H^' 

i 1 

where the Xi 0xe the Euclidean distances to the neigh- 
bors, and the exponent p can take on any positive value, 
typically but not necessarily an integer. (2) The input 
features can be standardized such that the mean and 
variance of each are and 1 respectively. This stops 
the training being dominated by features with larger nu- 
merical values or spreads. Alternatively, one could also 
normalize the range of features to be 0-1. (3) Objects 
in the training set can be allocated to collective regions 
of parameter space, which can considerably reduce the 
required number of distance calculations. 

Of the methods described, we implement (1) and (2), 
but not (3) as we wish to use the full information avail- 
able in the training data. We optimize the values of k 
and p and standardize all training features. Further re- 
finements can also be made for objects which have non- 
continuous values such as a classification or missing data. 
However, in this paper all values are considered, i.e., the 
training features and the spectroscopic and photometric 
redshifts, are continuous. 

3.2. Color- Redshift Relation 

We have implemented the color-r edshift relation (CZR) 
method of IWeinstein et all (|2004l ) on the same data as 
the IB. This enables a direct comparison of the per- 
formance of the two methods. The CZR establishes 
an empirical relation between the spectroscopic redshifts 
and the colors of the training set. The maximum like- 
lihood redshift Probability Density Function (PDF) is 
then found for each object in the test set. 

3.3. Genetic Algorithms 

The methods above select and optimize a learning al- 
gorithm for a given set of training features. However, it 
is possible that different subsets of the features available 
will produce better results. In particular, the results for 
instance-based learning can be made worse by noise in 
the training set or by irrelevant training features. To 
explore this possibility, we implement a binary genetic 
algorithm on the training featur e sets. 

A genetic algorithm (GA: e.g. . lHollaridlll975l : iGoldberd 
SHIS lHaiiPt fc HauptJ fl998h mimics evolution, in the 
sense that the most successful individuals are those that 
are best adapted for the task at hand. We implement 
the binary genetic algorithm, in which each individual is 
a string of 0s and Is which represents whether or not to 
use a particular input feature (in our case the 16 colors). 
An initial population of random individuals is created 
and the IB is run using the features selected. The re- 
sult, in this case the variance between photometric and 
spectroscopic redshift, is the fitness of that individual. 
The individuals and their fitnesses are then combined to 
produce new individuals, and those with higher fitnesses 
are favored. In principle, a good approximation to the 
best set of features to use as the training set should be 
selected with this approach. 

The combination involves identifying the best individu- 
als to breed via tournament selection, in which a specified 



number of individuals from the population are selected 
and the best is put in the mating pool to be combined 
with other individuals. Two individuals are combined 
using one point crossover, in which a segment of one is 
swapped with that of the other. To more fully explore 
the parameter space and prevent the algorithm from con- 
verging too rapidly on a local minimum, a probability of 
mutation is introduced on the newly created individuals 
before they are processed. This is simply the probability 
that a becomes a 1, or vice- versa. 
An approximate number of individuals to use is given 

by 

n in ~ 2nf log(n,f), 

where nf is the number of features. Here, for the 
SDSS and GALEX-SDSS-only, n f = 16, and for 
SDSS+GALEX, n f = 21. Hence, n m ~ 39,56 respec- 
tively for these two values of nf. The algorithm con- 
verges, i.e., finds the best individual and hence the best 
training set, in 

nn ~ anf log{nf) 

iterations, where a is a problem-dependent constant. 
Generally a > 3, giving an expected value for our data 
of nn ~ 58 for n/ = 16, and nu ~ 83 for nf — 21. We 
employ this number of iterations with larger numbers of 
individuals 4 to be sure that the algorithm has converged. 
Further infor mation on genetic algorithm design can be 
found in, e.g.. iGoldberel (|2002T ). 

Our GA is implemented on the IB for each of the SDSS, 
SDSS+GALEX and GALEX-SDSS-only datasets. The 
settings of these algorithms are fixed for the duration of 
the GA iteration. It is possible in principle to combine 
the optimization of the learning algorithm and the fea- 
ture set; however, we defer this analysis to a later paper. 

3.4. Training and Quality of Redshifts 

The IB and CZR are supervised learning algorithms — 
they are given a training set of objects and attempt to 
minimize a cost function which describes the quality of 
the predictions on a separate testing set. 

For IB, the cost function is given by the variance be- 
tween the photometric and spectroscopic redshifts for ob- 
jects with spectra: 

((Az) 2 )-(Az) 2 , 

where Az — \z spcc — Zphot|j z S pcc is the spectroscopic red- 
shift value, and z p hot is the photometric redshift predic- 
tion made by the learning algorithm. The second term 
in the variance equation is small. 

The value of the variance is dominated by the outliers. 
However, in our case, this is a desirable property, be- 
cause it is these objects which we wish to pull in the 
most toward the correct values. The dominance of the 
outliers renders the variance susceptible to variations in 
this population. We therefore quote errors on all of our 
blind test variances, derived from splitting the popula- 
tion using multiple random seeds (see below). For the 
CZR, the cost function is the likelihood of the PDF. 

Instance-based learning, like any supervised machine 
learning algorithm, is susceptible to incompleteness and 

4 300 for SDSS+GALEX, and 200 for the other two datasets. 
These numbers were selected for other tests not reported here, and 
simply strengthen the null result. 
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noise in the training set. At the present time, the SDSS 
DR5 is by far the largest and most homogeneous quasar 
dataset available, and it has a high completeness (e.g., 
IVanden Berk et all I2005D . Othe r available datasets are 
either not as deep, smaller (e.g.. lCroom et aTll2004D, or 
deepe r but orders of magnitude smaller (e.g.. lWolf et al.1 
2004). One could prune noisy exemplars, however, it is 
difficult to meaningfully define what is a noisy or sparsely 
populated region of parameter space, and pruning par- 
ticular regions could introduce new and poorly defined 
biases. The use of multiple nearest neighbors smoothes 
the noise, and the blind test results address both incom- 
pleteness and noise by presenting realistic results on un- 
seen data. 

The distance measure parameters of a number of near- 
est neighbors and the distance weighting assume that the 
input training features are uncorrelated, however, given 
that we repeat the same four colors in four magnitude 
types, and in addition that a set of features is always 
derived from a particular object, the input features will 
always be correlated, both in magnitude type (e.g., PSF 
u — g is correlated to fiber u — g, and so on), and in 
color (e.g., PSF u — g is correlated to PSF g — r, and 
so on.) Correlated input features are therefore unavoid- 
able; we feel, however, that our algorithmic approach is 
acceptable because we select the parameters to produce 
the optimal blind test result. 

Different splits of the training set are investigated at 
various points in the learning process, giving four ad- 
justable ratios: (1) r tra i n is the ratio between the data 
used as the training set and for testing the algorithm's 
performance according to the cost function to adjust the 
final model settings (for IB there is no adjustment so the 
ratio just affects the performance through the informa- 
tion available). (2) rbiind is the ratio of the whole set 
of data used in training and testing to that unseen by 
the algorithm until it is applied, as it would be to new 
data from another survey; this is the pseudo-blind test. 
(3) ^bagging is the ratio of the data used in each bagged 
model to the rest of the training data, where the training 
data is r tra i n of the whole dataset. (4) r cross _ va i is simi- 
lar, but for cross-validation. The latter is distinguished 
from bagging because it takes different random subsam- 
ples of the whole r tra in training and 1 — r tra in testing set, 
whereas bagging subsamples r tra i n . 

The value for which we quote results for all of these 
ratios is 80:20. For application to new data not used 
here, the value of rtrain would be 100%, to maximize the 
information available. This is the standard a 2 reported 
in the literature for CZR techniques, but its value would 
be meaningless for instance-based approaches. 

For IB, the variances obtained are quoted from the 
pseudo-blind test, as this represents the most realistic 
standard of performance available from within the SDSS 
and GALEX datasets to be expected on new data, rbiind 
is always such that the training data is representative of 
the full dataset. 

We quote the mean and standard deviation of the best 
variance from ten training runs with differing random 
seeds for rbiind- Each run produces a grid of models with 
the range 1 < k < 50 and 1 < p < 10, where k is the 
number of nearest neighbors and p the exponent in the 
distance- weighting function ( £|3.ip . Integral values of k 
and p were used, although this is not a requirement. We 



use positive values of p as negative values would result 
in objects other than the nearest neighbor being given 
the highest weighting, which would be unphysical as in- 
creasingly large values of k would be given an ever higher 
weight. We investigated bagging and cross-validation us- 
ing values of rb a ggmg and r cross _ va i of 80:20 and 50:50 
but these were not found to be necessary for IB. Other 
measures, such as Az/ (1 + z), and the percentage of ob- 
jects within Az < 0.1, 0.2, and 0.3 are also given for com- 
parison to other work. We do not quote any results in 
which there is any overlap between the training and test- 
ing data. 

The comparative CZR results were obtained by using 
a 10-fold bootstrapped pseudo-blind test, again in the 
ratio rbiind = 80:20. 

4. RESULTS 

We now describe results for the full SDSS DR5, SDSS 
DR5 + GALEX GR2, and GALEX-SDSS-only datasets, 
all of which are summarized in Table [1] 

4.1. SDSS DR5 

We found that the ideal parameters are 22 ± 5 near- 
est neighbors (NN) and a distance weighting (DW) of 
3.7 ± 0.5. In the pseudo-blind test on the unseen 20% of 
the data, the best variance between the photometric and 
spectroscopic redshifts is 0.123 ± 0.002. A comparison 
between the photometric and spectroscopic redshifts is 
shown in Figure [IJ and the effect of varying the NN and 
the DW for the pseudo-blind test is shown in Figure O 
We find that 54.9 ± 0.7%, 73.3 ± 0.6%, and 80.7 ± 0.3% 
of the objects are within Az < 0.1, 0.2, and 0.3, re- 
spectively. The variance weighted by redshift is a 1 = 
0.034 ±0.001 and the mean Azj{\ + z) = 0.095 ±0.001. 

Because the values of NN and DW used here are dis- 
crete (in principle they can be continuous, but that was 
not attempted), the results presented in Figure Q] were 
obtained with the values of NN, DW and the blind test 
set random seed that gave the best variance in its grid 
that was closest to the mean. Here, these values are 
NN = 22, DW = 4 and a random seed of 8 (for the seeds 
we used the integers to 9). The variance is 0.1240, 
which is consistent with the mean variance quoted. 

Our key result, shown in Figure [T] is the absence of 
regions of catastrophic failure — there is no upturn in a 
histogram of Az values at large Az, just a smooth decline 
such that few objects are outliers. This is in contrast to 
previous results for quasar photometric redshifts, which, 
while showing a comparable spread of objects with low 
Az, show outlying regions of objects with high Az. The 
scattering of outliers obtained by the IB is similar in 
form to that seen in other studies for normal galaxies at 
redsh ifts of z < 1 (see, for example, Figure 3 of lBall et al.l 
|2004| for SDSS Main Sample galaxies, which have a mean 
redshift of z ~ 0.1), although there is still structure seen 
in Figure[lJ especially at z spoc < 1, and z spcc ~ 2.2. 

We have also implemented the methods of W04 on 
the SDSS DR3, without removing the reddened quasars 
(Strand, in preparation). Here we apply that method to 
the SDSS DR5 dataset as a direct comparison between 
the empirical CZR and the IB. We find that the CZR has 
slightly narrower dispersion than the IB, with Az per- 
centages of 63.9±0.3%, 80.2±0.4% and 85.7±0.3% within 
Az < 0.1, 0.2, and 0.3. However, as shown in Figure [3l it 
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still shows regions of catastrophic failure. The variance 
is therefore significantly higher, at a 1 — 0.265 ± 0.006. 
We again plot the run from the ten with the closest vari- 
ance to the mean. In this case this was the final run of 
the ten, with a 2 = 0.2653. 

Previous results using empirical CZRs show a similar 
pattern. For example, Figure 4 of W04 shows regions 
of quasars at < z p hot < 1 and 1.5 < z p hot < 4.5 over 
the spectroscopic redshift range < z s p PC < 4. Similar 
result s are s een in | Budavari et al.l (|2001[ ) , Richard s et al.l 
(|200lh . and lWu et all (1200411 . 

4.2. SDSS DR5 + GALEX GR2 

Adding the GALEX data significantly improves the 
results, as shown in Figures [4] and [5l Here we obtain 
a variance of 0.054 ± 0.005 for the pseudo-blind test, 
70. 8 ± 1 . 2% , 85 . 8 ± 1 . 0% and 90 .8 ± . 7% of objects within 
Az < 0.1, 0.2, and 0.3, a 2 = 0.014 ±0.002, and the mean 
Az/(l + z) = 0.060 ±0.003. 

The number of nearest neighbors and distance weight- 
ing are 17±5 and 4.4±0.8 respectively. A higher distance 
weighting is expected due to the greater dimensionality 
of the training feature space (21 colors instead of 16) 
compared to the SDSS dataset. 

The exact values of NN and DW that are plotted in 
Figure[4]are chosen in the same manner as for the SDSS, 
and are NN =12, DW = 5 and a random seed of 3. The 
variance is 0.0521. 

To show that the improvement is not simply due to the 
smaller set of objects which appear in both surveys (for 
example, these objects may be brighter quasars in the 
SDSS with better photometry), we also applied the SDSS 
training procedure to the cross-matched sample. This 
gives better results than the SDSS sample, but they are 
still significantly worse than SDSS±GALEX. The vari- 
ance is a 2 = 0.090 ± 0.007, and the other results are as 
seen in Table [TJ 

The SDSS results extend deeper than those matched 
with GALEX, to z < 6 rather than z < 3.5. The lack 
of quasars in the 'redshift desert' at z > 2.2 is seen in 
Figure IH caused by the Lyman break in the spectrum at 
a restframe wavelength of 912 A being shifted out of the 
UV. 

The CZR results for SDSS+GALEX also improve over 
those from the full SDSS dataset. 74.9 ± 1.4%, 86.9 ± 
0.6%, and 91.0 ± 0.8% of the objects are within Az < 
0.1, 0.2, and 0.3. This is still slightly better than IB for 
Az < 0.1 and Az < 0.2, but is the same for Az < 0.3. 

4.3. Genetic Algorithms 

The application of the genetic algorithms on the SDSS, 
SDSS±GALEX and GALEX-SDSS-only datasets con- 
verged on the use of approximately half of the training 
parameters, but the variance was not significantly differ- 
ent from that from using the full set of training features. 
The full sets were therefore used throughout. The result 
indicates that there is some redundancy in the training 
features, which is expected given that they are measuring 
the four colors four different times, just through different 
apertures. 

5. DISCUSSION 

Although the results here represent an important step 
in the sense that there are no regions of catastrophic 



failure, further improvement is still possible. In partic- 
ular: (1) The input object parameter distributions may 
be generalized into the form of a PDF for each object, 
which can be propagated through the learning process, 
to make more explicit those objects for which the red- 
shift is less certain, to take into account the error on 
each parameter, and to output a PDF for each object 
instead of a scalar value. (2) The no-catastrophics of the 
instance-based and the lower low-Az dispersion of the 
CZR can be combined into a new learning algorithm. 
The IB is in fact able to obtain similar results to the 
CZR (i.e., an approximately 5% narrower dispersion and 
regions of catastrophic failure instead of a spread of ob- 
jects), by using the single nearest neighbor instead of k 
nearest neighbors. (3) The addition of other multiwave- 
length training data, s uch as infrare d data from UKIDS S 
(|Lawrence et al.ll2006t) and Spitzer ([Werner et al.ll2004h . 
can be included in the training process. 

We also obtained qu asar photometric redshifts using 
decision trees, as used in lBall et al.l (|2006h for star-galaxy 
separation. The variances obtained were generally com- 
parable to, but slightly worse than, those for instance- 
based, and are, therefore, not reported here. 

6. CONCLUSIONS 

We apply instance-based machine learning to 55,746 
objects spectroscopically classified as quasars in the Fifth 
Data Release of the Sloan Digital Sky Survey (SDSS), 
and to 7,642 objects cross-matched from this sample to 
the Second Data Release of the Galaxy Evolution Ex- 
plorer legacy data (SDSS±GALEX). 

The algorithm is able to assign photometric redshifts 
to quasars without regions of catastrophic failure, unlike 
previously published results. This will enable samples of 
quasars to be constructed for cosmological studies with 
minimal contamination from objects at severely incorrect 
redshifts. 

We obtain, for the same data, empirical color-redshift 
relations with full probability distributions and find that 
these are similar to previous results in the literature. 

For SDSS, we find a photometric-to-spectroscopic vari- 
ance of 0.123 ± 0.002 for a sample of the data not used 
in the training. For SDSS±GALEX, this improves to 
0.054 ± 0.005. Using purely SDSS on the latter dataset 
(GALEX-SDSS-only), the variance is 0.090 ± 0.007. 
Hence the improvement results from the extra UV infor- 
mation provided by GALEX and not the reduced sam- 
ple size, better photometry, or lower redshifts. The per- 
centages of objects within Az < 0.1 are 54.9 ± 0.7%, 
70.8 ± 1.2%, and 62.0 ± 1.4% for SDSS, SDSS+GALEX, 
and GALEX-SDSS-only, respectively. 

Each set of results represents a realistic standard for 
application to further datasets of which the spectra are 
representative. 

We thank the referee for a prompt and useful report 
which improved the paper, and Kumara Sastry of the 
Illinois Genetic Algorithms Laboratory for a clarification 
on our use of specific genetic algorithms. 

The authors acknowledge support from NASA through 
grants NN6066H156 and 05-GALEX05-0036, from Mi- 
crosoft Research, and from the University of Illinois. The 
authors made extensive use of the storage and comput- 
ing facilities at the National Center for Supercomputing 
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TABLE 1 

Summary of photometric redshift samples described in this paper. 



Dataset Method Variance Variance/(l+z) Mean Az/(1 + 2) %Az < 0.1 %Az < 0.2 %Az < 0.3 



SDSS 


IB 


0.123 


± 


0.002 


0.034 


± 


0.001 


0.095 


± 


0.001 


54.9 


± 


0.7 


73.3 


± 


0.6 


80.7 


± 


0.3 


SDSS+GALEX 


IB 


0.054 


± 


0.005 


0.014 


± 


0.002 


0.060 


± 


0.003 


70.8 


± 


1.2 


85.8 


± 


1.0 


90.8 


± 


0.7 


GALEX-SDSS-only 


IB 


0.090 


± 


0.007 


0.022 


± 


0.001 


0.081 


± 


0.003 


62.0 


± 


1.4 


78.9 


± 


1.0 


85.2 


± 


1.2 


SDSS 


CZR 


0.265 


± 


0.006 


0.079 


± 


0.003 


0.115 


± 


0.002 


63.9 


± 


0.3 


80.2 


± 


0.1 


85.7 


± 


0.3 


SDSS+GALEX 


CZR 


0.136 


± 


0.015 


0.031 


± 


0.006 


0.071 


± 


0.005 


74.9 


± 


1.4 


86.9 


± 


0.6 


91.0 


± 


0.8 


GALEX-SDSS-only 


CZR 


0.158 


± 


0.013 


0.041 


± 


0.004 


0.081 


± 


0.004 


74.1 


± 


0.8 


86.2 


± 


0.7 


89.7 


± 


0.6 
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Fig. 1. — Contour plot of quasar photometric redshifts assigned 
by the instance-based learner versus spectroscopic redshifts for the 
SDSS DR5 pseudo-blind testing sample of 11,149 of 55,746 quasars 
described in the text. For contouring, the objects are placed in 
bins of 0.05 in redshift, although the values on both axes are con- 
tinuous. The variance between the two measures over the whole 
redshift range is 0.123 ± 0.002. Compared to Figure [3] there are 
no regions of 'catastrophic' failure, in which objects are assigned 
a very different redshift to the true value, just a smoothly declin- 
ing spread of outliers. There are no objects outside the range of 
redshifts plotted. 
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Fig. 2. — Effect of varying the number of nearest neighbors 
and the distance weighting of the instance-based learner for the 
pseudo-blind test on the SDSS DR5 dataset, showing the mean 
from ten different training to pseudo-blind test splits of the data 
with a varying random seed. The model which gives the lowest 
variance is marked with la error bars. 




spec 

Fig. 3. — As Figurefl] but showing the results for the CZR pho- 
tozs. The regions of catastrophic failure are seen, and the overall 
variance is a = 0.265 ± 0.006. The values of z p hot resulting from 
this method are in bins of width 0.05. Here, a uniformly distributed 
random offset up to ±0.025 has been added to the values of 2 p hot 
for clarity. 
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Fig. 4. — As Figure[T] but showing the results for 1,528 of 7,642 
quasars present in the SDSS DR5 cross-matched to the GALEX 
GR2. The variance is improved to a 2 = 0.054 ± 0.005. 
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Fig. 5.— As Figure[2] but for the SDSS+GALEX dataset shown 
in Figure [4] 
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